Cleaning

Goals of this notebook

The steps we’ll take to prepare our data:

  • Download the data
  • Import it into our notebook
  • Clean up data types and columns
  • Export the data for next notebook
  • Make sure that the updated data that covers the year 2023, aligns with the rest of the data.

Setup

Loading up the libraries:

library(tidyverse)
library(janitor)
library(readxl)

Downloading Data

Here I will download the TCEQ data that I obtained from a public records request. The data goes over the counties or jurisdictions affected by water boil notices, start and end date of a water boil notice and the nature of the water boil notice.

More information about the data can be found on the index page of this notebook.

I downloaded the file straight into my computer rather than downloading it into this notebook because its in Excel formatting. I downloaded the data into my data-raw folder.

Importing the Data

Here I am reading the data into my notebook that I already have downloaded to my “data-raw” folder. Then, we’re adding our data to a code chunk. For simplicity it will be called “water_boil” which we will then take a peak at in order to look at what our data looks like in order to clean. We add the clean_names() function so that we can see our data columns in a clean way.

water_boil <- read_excel("data-raw/water_boil_2018_2022.xlsx") |> clean_names()

water_boil

Now we will glimpse the data:

water_boil |> glimpse()
Rows: 12,867
Columns: 8
$ rn_number    <chr> "RN100209386", "RN100209386", "RN100209386", "RN100210038…
$ rn_name      <chr> "ORANGE CARBON BLACK PLANT", "ORANGE CARBON BLACK PLANT",…
$ start_date   <dttm> 2021-06-08, 2021-02-17, 2020-09-02, 2021-02-15, 2022-12-…
$ end_date     <chr> "44368", "44253", "44085", "44256", "44932", "44788", "44…
$ status       <chr> "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED…
$ county       <chr> "ORANGE", "ORANGE", "ORANGE", "CALHOUN", "HARRIS", "HARRI…
$ jurisdiction <chr> "REGION 10 - BEAUMONT+1", "REGION 10 - BEAUMONT", "REGION…
$ nature       <chr> "WATER OUTAGE", "LOW DISTRIBUTION PRESSURE", "NATURAL DIS…

Notes for water boil notice cleaning:

  • According to TCEQ, an RN Name is a “Regulated Entity” which is a person, organization, place, or thing that is of environmental interest to TCEQ. Where regulatory activities of interest to the agency occur or have occurred in the past. Examples are a site, facility or license.” Each RN is given an 11 digit number, which is recorded in the column “RN#” More information on RNs can be found “here”.
  • We have to change both end and start data to a date column instead of chr or dttm
  • We have to clean names of columns and data
  • The formatting of the “Jurisdiction” column is a bit weird, why do some entities include this information and others exclude this information? What constitutes a jurisdiction?

Fixing the dates

We want to make sure that our dates are in month, date and year format for consistency in our analysis. We also want to make sure that our notebook recognizes that information as dates.

Our end date was having issues parsing, so we made sure to filter out the dates that did not exist or were “NA.” After we filtered them out we can see that these dates represent on going (or active) water boil notices, and that’s why they were formatted differently from “start_date.”

water_boil_dates <- water_boil |>
  mutate(
    end_date_notice = excel_numeric_to_date(end_date |> parse_number()), .after = end_date
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `end_date_notice =
  excel_numeric_to_date(parse_number(end_date))`.
Caused by warning:
! 444 parsing failures.
row col expected actual
 78  -- a number      -
120  -- a number      -
252  -- a number      -
276  -- a number      -
301  -- a number      -
... ... ........ ......
See problems(...) for more details.
water_boil_dates
##water_boil_dates |> filter(is.na(end_date_notice))

Importing Updated Data

I uploaded the new data that encompasses all water boil notices from January 2022 to December of 2023 into my data-raw folder. This data also encompasses a few boil notices from 2024.

Note: This data should be updated with newer 2024 and 2025 water boil notice data.

water_boil_new <- read_excel("data-raw/BWN_water_boil_new .xlsx") |> clean_names() 

water_boil_new 

Now we will glimpse the data:

water_boil_new |> glimpse()
Rows: 6,655
Columns: 16
$ issued                 <dttm> 2023-12-31, 2023-12-31, 2023-12-30, 2023-12-30…
$ rescinded              <dttm> 2024-01-01, 2024-01-04, 2024-01-02, 2024-01-03…
$ pws_id                 <chr> "TX1460136", "TX1830025", "TX0140006", "TX09600…
$ pws_name               <chr> "DAYTON OAKS ESTATE", "DEADWOOD WSC", "CITY OF …
$ population             <dbl> 203, 1176, 151261, 360, 8013, 7815, 1668, 2813,…
$ pws_type               <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C…
$ reason_activity        <chr> "BWN - LOW DISTRIBUTION PRESSURE", "BWN - LOW D…
$ reason                 <chr> "Due to MAINTENANCE AT THE PLANT", "Due to a br…
$ affected_area_activity <chr> "BWN AFFECTED AREA - SYSTEMWIDE", "BWN AFFECTED…
$ affected_area          <chr> "ALL", "Customers that live at 248 County Rod 4…
$ responsible_party      <chr> "JOAQUIN MONTES", "DEBBIE ALLUMS", "COREY ANDRE…
$ comments               <chr> "512-990-4400 X56109", "(903) 754-0100\r\n\r\n1…
$ region                 <chr> "12", "5", "9", "1", "12", "5", "14", "8", "10"…
$ county                 <chr> "LIBERTY", "PANOLA", "BELL", "HALL", "HARRIS", …
$ reissued_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ do_not_consume_use     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Cleaning names of new data

The older data and the newer data have different column names. In this new data I want to change “issued” and “rescinded” to “start_date” and “end_date.” I will change those last two columns when I change the variable to a date.

water_boil_new_names <- water_boil_new |>
  mutate(
    nature = reason
  )

water_boil_new_names |> glimpse()
Rows: 6,655
Columns: 17
$ issued                 <dttm> 2023-12-31, 2023-12-31, 2023-12-30, 2023-12-30…
$ rescinded              <dttm> 2024-01-01, 2024-01-04, 2024-01-02, 2024-01-03…
$ pws_id                 <chr> "TX1460136", "TX1830025", "TX0140006", "TX09600…
$ pws_name               <chr> "DAYTON OAKS ESTATE", "DEADWOOD WSC", "CITY OF …
$ population             <dbl> 203, 1176, 151261, 360, 8013, 7815, 1668, 2813,…
$ pws_type               <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C…
$ reason_activity        <chr> "BWN - LOW DISTRIBUTION PRESSURE", "BWN - LOW D…
$ reason                 <chr> "Due to MAINTENANCE AT THE PLANT", "Due to a br…
$ affected_area_activity <chr> "BWN AFFECTED AREA - SYSTEMWIDE", "BWN AFFECTED…
$ affected_area          <chr> "ALL", "Customers that live at 248 County Rod 4…
$ responsible_party      <chr> "JOAQUIN MONTES", "DEBBIE ALLUMS", "COREY ANDRE…
$ comments               <chr> "512-990-4400 X56109", "(903) 754-0100\r\n\r\n1…
$ region                 <chr> "12", "5", "9", "1", "12", "5", "14", "8", "10"…
$ county                 <chr> "LIBERTY", "PANOLA", "BELL", "HALL", "HARRIS", …
$ reissued_count         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ do_not_consume_use     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nature                 <chr> "Due to MAINTENANCE AT THE PLANT", "Due to a br…

Clean dates

It looks like our “start_date” column is facing the same excel date issue. We also have to change the end_date column so that it appears as a date variable.

water_boil_dates_new <- water_boil_new_names |>
  mutate(
       start_date = mdy(issued), 
       end_date = mdy(rescinded)
  )
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start_date = mdy(issued)`.
Caused by warning:
! All formats failed to parse. No formats found.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
water_boil_dates_new

Counting length of water boil notices

For my analysis I would like to look at the length of water boil notices and the counties where they are happening. In order to do this later I want to create a new column for the length of the water boil notices.

water_boil_length <- water_boil_dates_new |> 
  mutate(
    notice_length= rescinded - issued, .after = rescinded
  )

water_boil_length

Remove unnecessary column

I checked the original dataset to see if there were any rows of data in the “do_not_consume_use” column and they were all N/A. We will remove this unnecessary column with the filter option. I also want to remove the “end_date” and “start_date” column that are still in character format because I already created the new date column.

water_boil_filter <- water_boil_length |> 
  select(-c(end_date, start_date, do_not_consume_use)) 

water_boil_filter

Exporting clean data

Now that my data looks clean I will export to a new Quarto website page to start my analysis.

 water_boil_clean <- water_boil_filter|>
   write_rds("data-processed/01-water_boil.rds")